Load the dataset from a CSV file and Create DataFrame¶

Print the first few rows of the DataFrame to quickly inspect the data

In [1]:
import pandas as pd

# Load the downloaded CSV file
data = pd.read_csv('knn_authentication_features.csv')
print("Loaded Data:")
data.head()
Loaded Data:
Out[1]:
filename label student_id Mfcc_1 Mfcc_2 Mfcc_3 Mfcc_4 Mfcc_5 Mfcc_6 Mfcc_7 ... Spectral_Contrast_2 Spectral_Contrast_3 Spectral_Contrast_4 Spectral_Contrast_5 Spectral_Contrast_6 Spectral_Contrast_7 Zero_Crossing_Rate RMS_Energy Spectral_Centroid Spectral_Bandwidth
0 hw1_q2_610399205_male.mp3_0 male 610399205 -94.829977 122.175083 -68.000374 47.507958 -14.407363 -13.427573 -23.064053 ... 16.537672 17.337876 15.638074 18.102693 23.265461 66.736837 0.105971 [0.24632083] 1911.823347 1581.462113
1 hw1_q2_610399205_male.mp3_1 male 610399205 -123.197408 131.659141 -61.479495 44.792372 -7.927916 -10.119547 -21.518692 ... 16.173748 17.615650 15.370934 17.944107 21.722387 64.664952 0.079931 [0.21608743] 1656.721908 1499.382229
2 hw1_q2_610399205_male.mp3_2 male 610399205 -122.910702 141.014834 -67.712628 39.928565 -14.785694 -7.251495 -24.902940 ... 15.759529 17.226226 15.636203 18.222485 21.671965 63.307834 0.096313 [0.19001988] 1707.593297 1448.149975
3 hw1_q2_610399205_male.mp3_3 male 610399205 -140.025639 126.584783 -53.874062 40.963797 -5.784167 -5.752144 -22.515332 ... 15.341300 15.522699 14.462609 18.687998 22.817218 61.284887 0.099141 [0.16602059] 1857.183068 1591.809683
4 hw1_q2_610399205_male.mp3_4 male 610399205 -127.307332 122.204455 -61.316071 54.882303 -0.544779 -11.455723 -28.317545 ... 17.254408 17.243651 14.808253 19.110663 23.740752 67.221951 0.096767 [0.21153131] 1830.105791 1537.769633

5 rows × 27 columns

In [3]:
list(data.columns)
Out[3]:
['filename',
 'label',
 'student_id',
 'Mfcc_1',
 'Mfcc_2',
 'Mfcc_3',
 'Mfcc_4',
 'Mfcc_5',
 'Mfcc_6',
 'Mfcc_7',
 'Mfcc_8',
 'Mfcc_9',
 'Mfcc_10',
 'Mfcc_11',
 'Mfcc_12',
 'Mfcc_13',
 'Spectral_Contrast_1',
 'Spectral_Contrast_2',
 'Spectral_Contrast_3',
 'Spectral_Contrast_4',
 'Spectral_Contrast_5',
 'Spectral_Contrast_6',
 'Spectral_Contrast_7',
 'Zero_Crossing_Rate',
 'RMS_Energy',
 'Spectral_Centroid',
 'Spectral_Bandwidth']

Initially, the values in the RMS_Energy column are stored as strings, and they contain square brackets. This format is not suitable for numerical analysis, and hence, we need to clean and convert the data.

In [4]:
data['RMS_Energy'] = data['RMS_Energy'].astype(str).str.replace('[', '', regex=False).str.replace(']', '', regex=False).astype(float)

Utills¶

Required Functions


K_fold_Crossvalidation Function:¶


  • This function provides an efficient way to compare classification models and evaluate various features and feature reduction techniques.

  • Using the KFold class from sklearn.model_selection, the data is split into k folds (default is 5). This ensures that each fold is used as a test set once, and as part of the training set k-1 times.

  • Standard scaling is applied to the feature sets using StandardScaler to normalize the data

  • If a feature reduction function is provided, we fit it using the training data and then apply it to the test data. This step ensures that the feature reduction technique is appropriately trained on the training set before being applied to unseen test data.

  • The average accuracy score across all folds are calculated to provide an overall performance metric. Additionally, the average confusion matrix is plotted to visualize the performance.

In [5]:
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

def k_fold_cross_validation(df, feature_names, target_name, model, feature_reduction_func, k=5):
    """
    Perform k-fold cross-validation on the given data.

    Parameters:
    df (pd.DataFrame): The input data frame.
    feature_names (list): List of feature column names.
    target_name (str): The name of the target column.
    model: The classification model to be used.
    feature_reduction_func: The feature reduction function to be applied.
    k (int): The number of folds for cross-validation (default is 5).

    Returns:
    float: The average accuracy score across all folds.
    """
    X = df[feature_names]
    y = df[target_name]

    kf = KFold(n_splits=k, shuffle=True, random_state=42)
    accuracies = []
    confusion_matrices = []

    for train_index, test_index in kf.split(X):
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

        # Feature scaling
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)

        # Feature reduction
        if(feature_reduction_func):
          X_train_scaled = feature_reduction_func.fit_transform(X_train_scaled, y_train)
          X_test_scaled = feature_reduction_func.transform(X_test_scaled)


        model.fit(X_train_scaled, y_train)

        # Make predictions

        #test_accuracy = model.evaluate(X_test_scaled, y_test)
        y_pred = model.predict(X_test_scaled)

        # Calculate accuracy
        accuracy = accuracy_score(y_test, y_pred)
        accuracies.append(accuracy)

        # Generate confusion matrix
        conf_matrix = confusion_matrix(y_test, y_pred)
        confusion_matrices.append(conf_matrix)

    # Calculate average accuracy
    avg_accuracy = np.mean(accuracies)*100
    std_accuracy = np.std(accuracies)*100

    # Plot the average confusion matrix
    avg_conf_matrix = np.mean(confusion_matrices, axis=0)
    plt.figure(figsize=(5, 3))
    sns.heatmap(avg_conf_matrix, annot=True, fmt='.2f', cmap='Blues')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title('Average Confusion Matrix')
    plt.show()

    return avg_accuracy, std_accuracy

Visualization of Cross-Validation Results¶

Plot_Results_Barchart¶


To effectively compare the performance of various classification models and feature reduction techniques, we utilize a bar chart visualization. The Plot_Results_Barchart function is designed to generate and display these bar charts, providing clear and intuitive insights into the cross-validation results.

For each feature reduction method, the function filters the results and plots a bar chart showing the mean accuracy of different models for each feature set. The maximum accuracy for each feature reduction method is printed, and bars with the highest accuracy are highlighted with red edges.

In [6]:
def Plot_Results_Barchart(results_df):

  # List of feature reduction methods
  feature_reductions = results_df['Feature Reduction'].unique()

  # Set a pleasing color palette
  sns.set_palette("Set2")

  # Loop to plot bar charts for each feature reduction method
  for reduction in feature_reductions:
      # Filter data for the current feature reduction method
      filtered_results = results_df[results_df['Feature Reduction'] == reduction]
      max_value = filtered_results['Mean Accuracy'].max()
      print(f"Maximum accuracy if feature reduction({reduction}): {max_value:.2f}%")

      # Plot the bar chart
      plt.figure(figsize=(7, 5))
      ax = sns.barplot(x='Feature Set', y='Mean Accuracy', hue='Model', data=filtered_results, errorbar=None, width=0.4)


      # Add solid lines around each bar
      for bar in ax.patches:
          if bar.get_height() == max_value:
              bar.set_edgecolor('red')
              bar.set_linewidth(2)
          else:
              bar.set_edgecolor('black')
              bar.set_linewidth(0.5)

      # Set titles and labels
      plt.title(f'Cross-Validation Mean Accuracy for Different Models ({reduction})', fontsize=10)
      plt.xlabel('Feature Set', fontsize=8)
      plt.ylabel('Mean Accuracy', fontsize=8)
      plt.ylim(0, 100)
      plt.xticks(fontsize=8)
      plt.yticks(fontsize=8)
      plt.legend(title='Model', loc='upper center', bbox_to_anchor=(0.5, -0.2), ncol=3, fontsize=8)

      plt.tight_layout()
      plt.show()

Classification Using Different Models, Feature Reduction, and Feature Sets¶

Run_Classification¶


The Run_Classification function is a comprehensive approach designed to evaluate the performance of various classification models using different feature sets and feature reduction techniques. This function leverages k-fold cross-validation(k=4) to provide robust performance metrics for each combination of model, feature set, and feature reduction method.

  • Various classification models are defined for evaluation:

    • K-Nearest Neighbors (KNN)
    • Logistic Regression
    • Support Vector Machine (SVM)
  • Feature Reduction Techniques:

    • Linear Discriminant Analysis (LDA)
    • Principal Component Analysis (PCA)
    • No feature reduction (None)
  • Defining Feature Sets:

    • mfcc_features: Mel-Frequency Cepstral Coefficients (MFCC) features.
    • spectral_contrast_features: Spectral Contrast features.
    • combined_features: A combination of MFCC and Spectral Contrast features.
    • time_features: Time-domain features such as Zero Crossing Rate and RMS Energy.
    • Spec_cent_Bw: Spectral Centroid and Spectral Bandwidth features.
    • all_features: A combination of all the above feature sets.

The results DataFrame, containing the performance metrics for each combination.

In [7]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC


from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

def Run_Classification(balanced_dataFrame):
  # Define the feature sets
  mfcc_features = [f'Mfcc_{i+1}' for i in range(13)]
  spectral_contrast_features = [f'Spectral_Contrast_{i+1}' for i in range(7)]
  combined_features = mfcc_features + spectral_contrast_features
  time_features = ['Zero_Crossing_Rate',	'RMS_Energy']
  Spec_cent_Bw = ['Spectral_Centroid',	'Spectral_Bandwidth']
  all_features = combined_features + time_features + Spec_cent_Bw

  balanced_dataFrame['RMS_Energy'] = balanced_dataFrame['RMS_Energy'].astype(str).str.replace('[', '', regex=False).str.replace(']', '', regex=False).astype(float)

  # Define the target column
  target_column = 'class'

  # Define the models
  models = {
      'KNN': KNeighborsClassifier(n_neighbors=5),
      'Logistic Regression': LogisticRegression( max_iter=200, random_state= 42),
      'SVM': SVC(kernel='linear', random_state=42)
  }

  # Define the feature sets
  feature_sets = {
      'time_domain_features' : time_features,
      'MFCC': mfcc_features,
      'Spectral Contrast': spectral_contrast_features,
      'MFCC&Sp_Contrast': combined_features,
      'Spec_Cent&BW': Spec_cent_Bw,
      'all_features' : all_features
  }

  feature_reductions = {
      'LDA': LinearDiscriminantAnalysis(),
      'PCA': PCA(),
      'None': None
  }

  # Run cross-validation and store results
  results = []
  for model_name, model in models.items():
      for feature_set_name, feature_columns in feature_sets.items():
          for reduction_name, reduction_func in feature_reductions.items():
              print(f'\nModel({model_name}) using feature({feature_set_name}) and feature reduction({reduction_name})\n')
              mean_accuracy, std_accuracy = k_fold_cross_validation(balanced_dataFrame, feature_columns, target_column, model, reduction_func, k=4)
              results.append({
                  'Model': model_name,
                  'Feature Set': feature_set_name,
                  'Feature Reduction': reduction_name,
                  'Mean Accuracy': mean_accuracy,
                  'Std Accuracy': std_accuracy
                  })

  results_df = pd.DataFrame(results)
  return results_df

Random Student Set1¶

Select 6 random Student and Ensure each student has an equal number of samples and visualize some features using pairplots.

Random Student Selection: We begin by randomly selecting 6 students from the dataset. This ensures that our analysis is not biased towards any specific subset of students.

In [8]:
import random

random.seed(42)

# Get unique student IDs
student_ids = data['student_id'].unique()

# Randomly select 6 students
selected_students = random.sample(list(student_ids), 6)
print(f"Selected Students: {selected_students}")
Selected Students: [np.int64(810103183), np.int64(810100091), np.int64(810198554), np.int64(810103317), np.int64(810199489), np.int64(810101465)]

Equal Number of Samples: For each of the selected students, we ensure that an equal number of samples are used for further analysis. This step is crucial to maintain consistency and fairness in our analysis.

In [9]:
from sklearn.utils import shuffle

filtered_data = data[data['student_id'].isin(selected_students)]

# Ensure each student has an equal number of samples
balanced_data = filtered_data.groupby('student_id').apply(lambda x: x.sample(n=filtered_data['student_id'].value_counts().min(), random_state=42)).reset_index(drop=True)
balanced_data = shuffle(balanced_data, random_state=42)
balanced_data.reset_index(drop=True, inplace=True)

print("Number of samples for each student in the balanced DataFrame:")
print(balanced_data['student_id'].value_counts())
Number of samples for each student in the balanced DataFrame:
student_id
810103183    123
810101465    123
810198554    123
810199489    123
810103317    123
810100091    123
Name: count, dtype: int64
/tmp/ipykernel_15412/3536119780.py:6: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
  balanced_data = filtered_data.groupby('student_id').apply(lambda x: x.sample(n=filtered_data['student_id'].value_counts().min(), random_state=42)).reset_index(drop=True)

Pairplot Visualization:¶

Pairplots are used to visualize the relationships between different features for the selected students. This provides an intuitive and visual understanding of the data distribution and correlations.

MFCC features

In [10]:
import seaborn as sns


mfcc_features = [f'Mfcc_{i+1}' for i in range(13)]
# Plot pair plot
palette = sns.color_palette("tab10", balanced_data['student_id'].nunique())
sns.pairplot(balanced_data, hue='student_id', vars= mfcc_features, palette=palette)

# Display the plot
plt.show()
No description has been provided for this image

Spectral_Contrast features

In [11]:
import seaborn as sns

spectral_contrast_features = [f'Spectral_Contrast_{i+1}' for i in range(7)]
# Plot pair plot
palette = sns.color_palette("tab10", balanced_data['student_id'].nunique())
sns.pairplot(balanced_data, hue='student_id', vars= spectral_contrast_features, palette=palette)

# Display the plot
plt.show()
No description has been provided for this image

Time domain features: Zero Crossing Rate and RMS Energy

In [12]:
import seaborn as sns

time_features = ['Zero_Crossing_Rate',	'RMS_Energy']
# Plot pair plot
palette = sns.color_palette("tab10", balanced_data['student_id'].nunique())
sns.pairplot(balanced_data, hue='student_id', vars= time_features, palette=palette)

# Display the plot
plt.show()
No description has been provided for this image

Classification and Analysis for Student Set1¶

Mapping Student IDs to Class Labels: We create a mapping from student IDs to class labels. Each selected student is assigned a unique class label, starting from 0 up to the number of selected students minus one.

In [13]:
# Map student IDs to class labels (e.g., 0 to 5)
class_mapping = {student_id: idx for idx, student_id in enumerate(selected_students)}
balanced_data['class'] = balanced_data['student_id'].map(class_mapping)


print("Balanced DataFrame with Class Column:")
balanced_data.tail()
Balanced DataFrame with Class Column:
Out[13]:
filename label student_id Mfcc_1 Mfcc_2 Mfcc_3 Mfcc_4 Mfcc_5 Mfcc_6 Mfcc_7 ... Spectral_Contrast_3 Spectral_Contrast_4 Spectral_Contrast_5 Spectral_Contrast_6 Spectral_Contrast_7 Zero_Crossing_Rate RMS_Energy Spectral_Centroid Spectral_Bandwidth class
733 hw1_q1_810100091_male.mp3_14 male 810100091 -209.270224 159.299467 -12.862626 38.995735 -8.071367 1.210875 -26.295211 ... 18.579838 16.371176 18.578580 18.729660 57.705217 0.045901 0.146539 1065.860536 1343.465670 1
734 hw1_q4_810100091_male.mp3_35 male 810100091 -180.091125 158.623054 -8.343020 38.221878 0.616179 6.856343 -26.354627 ... 17.178651 14.908284 17.176830 17.841851 58.990186 0.047214 0.206644 1160.097436 1424.149759 1
735 hw1_q4_810103183_male.mp3_2 male 810103183 -197.288841 153.435223 -21.336439 39.097099 -28.892864 -4.788617 -17.154573 ... 18.552319 16.887290 18.161144 22.510156 58.677515 0.074786 0.160438 1388.911960 1346.191071 0
736 hw1_q4_810103317_male.mp3_27 male 810103317 -196.783701 156.871713 2.812224 30.233925 4.374440 -1.058855 -26.090089 ... 20.248830 16.543377 19.366785 18.958355 58.715747 0.054211 0.242592 1063.924159 1260.229348 3
737 hw1_q1_810100091_male.mp3_9 male 810100091 -161.626642 153.208495 -11.947535 47.287845 -4.489833 -2.877725 -31.232691 ... 19.821527 17.225590 17.476504 18.599111 60.334727 0.049144 0.271515 1141.883666 1375.061913 1

5 rows × 28 columns

Classification Process and Average Confusion Matrix¶

Calling the Classification Function¶

The Run_Classification function is called with the balanced dataset (balanced_data) as its input. This function performs k-fold cross-validation for various combinations of classification models, feature sets, and feature reduction techniques. The Result_DF DataFrame now contains detailed performance metrics for various combinations of models, feature sets, and feature reduction techniques. This structured summary facilitates easy comparison and selection of the most effective approaches for the classification task.

In [14]:
Result_DF = Run_Classification(balanced_data)
Model(KNN) using feature(time_domain_features) and feature reduction(LDA)

No description has been provided for this image
Model(KNN) using feature(time_domain_features) and feature reduction(PCA)

No description has been provided for this image
Model(KNN) using feature(time_domain_features) and feature reduction(None)

No description has been provided for this image
Model(KNN) using feature(MFCC) and feature reduction(LDA)

No description has been provided for this image
Model(KNN) using feature(MFCC) and feature reduction(PCA)

No description has been provided for this image
Model(KNN) using feature(MFCC) and feature reduction(None)

No description has been provided for this image
Model(KNN) using feature(Spectral Contrast) and feature reduction(LDA)

No description has been provided for this image
Model(KNN) using feature(Spectral Contrast) and feature reduction(PCA)

No description has been provided for this image
Model(KNN) using feature(Spectral Contrast) and feature reduction(None)

No description has been provided for this image
Model(KNN) using feature(MFCC&Sp_Contrast) and feature reduction(LDA)

No description has been provided for this image
Model(KNN) using feature(MFCC&Sp_Contrast) and feature reduction(PCA)

No description has been provided for this image
Model(KNN) using feature(MFCC&Sp_Contrast) and feature reduction(None)

No description has been provided for this image
Model(KNN) using feature(Spec_Cent&BW) and feature reduction(LDA)

No description has been provided for this image
Model(KNN) using feature(Spec_Cent&BW) and feature reduction(PCA)

No description has been provided for this image
Model(KNN) using feature(Spec_Cent&BW) and feature reduction(None)

No description has been provided for this image
Model(KNN) using feature(all_features) and feature reduction(LDA)

No description has been provided for this image
Model(KNN) using feature(all_features) and feature reduction(PCA)

No description has been provided for this image
Model(KNN) using feature(all_features) and feature reduction(None)

No description has been provided for this image
Model(Logistic Regression) using feature(time_domain_features) and feature reduction(LDA)

No description has been provided for this image
Model(Logistic Regression) using feature(time_domain_features) and feature reduction(PCA)

No description has been provided for this image
Model(Logistic Regression) using feature(time_domain_features) and feature reduction(None)

No description has been provided for this image
Model(Logistic Regression) using feature(MFCC) and feature reduction(LDA)

No description has been provided for this image
Model(Logistic Regression) using feature(MFCC) and feature reduction(PCA)

No description has been provided for this image
Model(Logistic Regression) using feature(MFCC) and feature reduction(None)

No description has been provided for this image
Model(Logistic Regression) using feature(Spectral Contrast) and feature reduction(LDA)

No description has been provided for this image
Model(Logistic Regression) using feature(Spectral Contrast) and feature reduction(PCA)

No description has been provided for this image
Model(Logistic Regression) using feature(Spectral Contrast) and feature reduction(None)

No description has been provided for this image
Model(Logistic Regression) using feature(MFCC&Sp_Contrast) and feature reduction(LDA)

No description has been provided for this image
Model(Logistic Regression) using feature(MFCC&Sp_Contrast) and feature reduction(PCA)

No description has been provided for this image
Model(Logistic Regression) using feature(MFCC&Sp_Contrast) and feature reduction(None)

No description has been provided for this image
Model(Logistic Regression) using feature(Spec_Cent&BW) and feature reduction(LDA)

No description has been provided for this image
Model(Logistic Regression) using feature(Spec_Cent&BW) and feature reduction(PCA)

No description has been provided for this image
Model(Logistic Regression) using feature(Spec_Cent&BW) and feature reduction(None)

No description has been provided for this image
Model(Logistic Regression) using feature(all_features) and feature reduction(LDA)

No description has been provided for this image
Model(Logistic Regression) using feature(all_features) and feature reduction(PCA)

No description has been provided for this image
Model(Logistic Regression) using feature(all_features) and feature reduction(None)

No description has been provided for this image
Model(SVM) using feature(time_domain_features) and feature reduction(LDA)

No description has been provided for this image
Model(SVM) using feature(time_domain_features) and feature reduction(PCA)

No description has been provided for this image
Model(SVM) using feature(time_domain_features) and feature reduction(None)

No description has been provided for this image
Model(SVM) using feature(MFCC) and feature reduction(LDA)

No description has been provided for this image
Model(SVM) using feature(MFCC) and feature reduction(PCA)

No description has been provided for this image
Model(SVM) using feature(MFCC) and feature reduction(None)

No description has been provided for this image
Model(SVM) using feature(Spectral Contrast) and feature reduction(LDA)

No description has been provided for this image
Model(SVM) using feature(Spectral Contrast) and feature reduction(PCA)

No description has been provided for this image
Model(SVM) using feature(Spectral Contrast) and feature reduction(None)

No description has been provided for this image
Model(SVM) using feature(MFCC&Sp_Contrast) and feature reduction(LDA)

No description has been provided for this image
Model(SVM) using feature(MFCC&Sp_Contrast) and feature reduction(PCA)

No description has been provided for this image
Model(SVM) using feature(MFCC&Sp_Contrast) and feature reduction(None)

No description has been provided for this image
Model(SVM) using feature(Spec_Cent&BW) and feature reduction(LDA)

No description has been provided for this image
Model(SVM) using feature(Spec_Cent&BW) and feature reduction(PCA)

No description has been provided for this image
Model(SVM) using feature(Spec_Cent&BW) and feature reduction(None)

No description has been provided for this image
Model(SVM) using feature(all_features) and feature reduction(LDA)

No description has been provided for this image
Model(SVM) using feature(all_features) and feature reduction(PCA)

No description has been provided for this image
Model(SVM) using feature(all_features) and feature reduction(None)

No description has been provided for this image

Sorting and Displaying Top Results¶

By sorting the results DataFrame by mean accuracy and displaying the top rows, we efficiently identify the best-performing combinations of models, feature sets, and feature reduction techniques.

In [15]:
df_sorted = Result_DF.sort_values(by='Mean Accuracy', ascending=False)
df_sorted.head()
Out[15]:
Model Feature Set Feature Reduction Mean Accuracy Std Accuracy
15 KNN all_features LDA 99.729730 0.468122
34 Logistic Regression all_features PCA 99.459459 0.540541
51 SVM all_features LDA 99.459459 0.540541
33 Logistic Regression all_features LDA 99.459459 0.540541
35 Logistic Regression all_features None 99.459459 0.540541

Ploting Results¶

The line Plot_Results_Barchart(Result_DF) calls the Plot_Results_Barchart function and passes the Result_DF DataFrame as an argument. This function generates and displays bar charts that visually compare the performance of various classification models and feature reduction techniques based on the cross-validation results stored in Result_DF.

In [16]:
Plot_Results_Barchart(Result_DF)
Maximum accuracy if feature reduction(LDA): 99.73%
No description has been provided for this image
Maximum accuracy if feature reduction(PCA): 99.46%
No description has been provided for this image
Maximum accuracy if feature reduction(None): 99.46%
No description has been provided for this image

Random Student Set2¶

In [17]:
import random

random.seed(8)

# Get unique student IDs
student_ids = data['student_id'].unique()

# Randomly select 6 students
selected_students = random.sample(list(student_ids), 6)
print(f"Selected Students: {selected_students}")
Selected Students: [np.int64(810102148), np.int64(810100193), np.int64(810600133), np.int64(810103054), np.int64(810100206), np.int64(810100168)]
In [18]:
from sklearn.utils import shuffle

filtered_data = data[data['student_id'].isin(selected_students)]

# Ensure each student has an equal number of samples
balanced_data = filtered_data.groupby('student_id').apply(lambda x: x.sample(n=filtered_data['student_id'].value_counts().min(), random_state=42)).reset_index(drop=True)
balanced_data = shuffle(balanced_data, random_state=42)
balanced_data.reset_index(drop=True, inplace=True)

print("Number of samples for each student in the balanced DataFrame:")
print(balanced_data['student_id'].value_counts())
Number of samples for each student in the balanced DataFrame:
student_id
810600133    97
810100206    97
810100193    97
810103054    97
810102148    97
810100168    97
Name: count, dtype: int64
/tmp/ipykernel_15412/3536119780.py:6: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
  balanced_data = filtered_data.groupby('student_id').apply(lambda x: x.sample(n=filtered_data['student_id'].value_counts().min(), random_state=42)).reset_index(drop=True)

Classification and Analysis for Student Set2¶

In [19]:
# Map student IDs to class labels (e.g., 0 to 5)
class_mapping = {student_id: idx for idx, student_id in enumerate(selected_students)}
balanced_data['class'] = balanced_data['student_id'].map(class_mapping)

print("Balanced DataFrame with Class Column:")
balanced_data.tail()
Balanced DataFrame with Class Column:
Out[19]:
filename label student_id Mfcc_1 Mfcc_2 Mfcc_3 Mfcc_4 Mfcc_5 Mfcc_6 Mfcc_7 ... Spectral_Contrast_3 Spectral_Contrast_4 Spectral_Contrast_5 Spectral_Contrast_6 Spectral_Contrast_7 Zero_Crossing_Rate RMS_Energy Spectral_Centroid Spectral_Bandwidth class
577 hw1_q6_810100168_female.mp3_56 female 810100168 -211.851905 132.622349 -26.187861 7.499521 -15.446996 -16.373846 -20.007774 ... 18.901962 18.813562 19.578638 19.716234 57.092843 0.086897 0.141222 1430.544644 1292.170410 5
578 hw1_q2_810100193_female.mp3_9 female 810100193 -184.267192 124.399497 5.529370 41.731653 -26.419279 -8.033805 -17.266343 ... 15.281587 16.121414 16.699920 16.846284 65.358917 0.055890 0.229206 1321.779222 1616.504355 1
579 hw1_q6_810100206_male.mp3_78 male 810100206 -264.700924 180.204484 16.873536 30.587170 -10.826312 -5.214259 -29.595162 ... 18.114748 16.465813 17.164280 16.028635 51.030329 0.051165 0.117173 893.509100 1109.506454 4
580 hw1_q6_810103054_male.mp3_22 male 810103054 -223.577594 137.644504 -7.995220 38.862804 -18.405657 14.722776 -18.204598 ... 19.636023 14.571491 18.150354 18.250443 59.959546 0.041126 0.132037 1133.692981 1511.824107 3
581 hw1_q3_810100193_female.mp3_17 female 810100193 -177.964937 146.832626 2.405349 40.226782 -23.222275 -23.508371 -36.809802 ... 19.449307 18.994886 17.612312 17.172073 59.866017 0.049638 0.214573 1117.108282 1373.884490 1

5 rows × 28 columns

Classification Process and Average Confusion Matrix¶

Calling the Classification Function¶

In [20]:
Result_DF = Run_Classification(balanced_data)
Model(KNN) using feature(time_domain_features) and feature reduction(LDA)

No description has been provided for this image
Model(KNN) using feature(time_domain_features) and feature reduction(PCA)

No description has been provided for this image
Model(KNN) using feature(time_domain_features) and feature reduction(None)

No description has been provided for this image
Model(KNN) using feature(MFCC) and feature reduction(LDA)

No description has been provided for this image
Model(KNN) using feature(MFCC) and feature reduction(PCA)

No description has been provided for this image
Model(KNN) using feature(MFCC) and feature reduction(None)

No description has been provided for this image
Model(KNN) using feature(Spectral Contrast) and feature reduction(LDA)

No description has been provided for this image
Model(KNN) using feature(Spectral Contrast) and feature reduction(PCA)

No description has been provided for this image
Model(KNN) using feature(Spectral Contrast) and feature reduction(None)

No description has been provided for this image
Model(KNN) using feature(MFCC&Sp_Contrast) and feature reduction(LDA)

No description has been provided for this image
Model(KNN) using feature(MFCC&Sp_Contrast) and feature reduction(PCA)

No description has been provided for this image
Model(KNN) using feature(MFCC&Sp_Contrast) and feature reduction(None)

No description has been provided for this image
Model(KNN) using feature(Spec_Cent&BW) and feature reduction(LDA)

No description has been provided for this image
Model(KNN) using feature(Spec_Cent&BW) and feature reduction(PCA)

No description has been provided for this image
Model(KNN) using feature(Spec_Cent&BW) and feature reduction(None)

No description has been provided for this image
Model(KNN) using feature(all_features) and feature reduction(LDA)

No description has been provided for this image
Model(KNN) using feature(all_features) and feature reduction(PCA)

No description has been provided for this image
Model(KNN) using feature(all_features) and feature reduction(None)

No description has been provided for this image
Model(Logistic Regression) using feature(time_domain_features) and feature reduction(LDA)

No description has been provided for this image
Model(Logistic Regression) using feature(time_domain_features) and feature reduction(PCA)

No description has been provided for this image
Model(Logistic Regression) using feature(time_domain_features) and feature reduction(None)

No description has been provided for this image
Model(Logistic Regression) using feature(MFCC) and feature reduction(LDA)

No description has been provided for this image
Model(Logistic Regression) using feature(MFCC) and feature reduction(PCA)

No description has been provided for this image
Model(Logistic Regression) using feature(MFCC) and feature reduction(None)

No description has been provided for this image
Model(Logistic Regression) using feature(Spectral Contrast) and feature reduction(LDA)

No description has been provided for this image
Model(Logistic Regression) using feature(Spectral Contrast) and feature reduction(PCA)

No description has been provided for this image
Model(Logistic Regression) using feature(Spectral Contrast) and feature reduction(None)

No description has been provided for this image
Model(Logistic Regression) using feature(MFCC&Sp_Contrast) and feature reduction(LDA)

No description has been provided for this image
Model(Logistic Regression) using feature(MFCC&Sp_Contrast) and feature reduction(PCA)

No description has been provided for this image
Model(Logistic Regression) using feature(MFCC&Sp_Contrast) and feature reduction(None)

No description has been provided for this image
Model(Logistic Regression) using feature(Spec_Cent&BW) and feature reduction(LDA)

No description has been provided for this image
Model(Logistic Regression) using feature(Spec_Cent&BW) and feature reduction(PCA)

No description has been provided for this image
Model(Logistic Regression) using feature(Spec_Cent&BW) and feature reduction(None)

No description has been provided for this image
Model(Logistic Regression) using feature(all_features) and feature reduction(LDA)

No description has been provided for this image
Model(Logistic Regression) using feature(all_features) and feature reduction(PCA)

No description has been provided for this image
Model(Logistic Regression) using feature(all_features) and feature reduction(None)

No description has been provided for this image
Model(SVM) using feature(time_domain_features) and feature reduction(LDA)

No description has been provided for this image
Model(SVM) using feature(time_domain_features) and feature reduction(PCA)

No description has been provided for this image
Model(SVM) using feature(time_domain_features) and feature reduction(None)

No description has been provided for this image
Model(SVM) using feature(MFCC) and feature reduction(LDA)

No description has been provided for this image
Model(SVM) using feature(MFCC) and feature reduction(PCA)

No description has been provided for this image
Model(SVM) using feature(MFCC) and feature reduction(None)

No description has been provided for this image
Model(SVM) using feature(Spectral Contrast) and feature reduction(LDA)

No description has been provided for this image
Model(SVM) using feature(Spectral Contrast) and feature reduction(PCA)

No description has been provided for this image
Model(SVM) using feature(Spectral Contrast) and feature reduction(None)

No description has been provided for this image
Model(SVM) using feature(MFCC&Sp_Contrast) and feature reduction(LDA)

No description has been provided for this image
Model(SVM) using feature(MFCC&Sp_Contrast) and feature reduction(PCA)

No description has been provided for this image
Model(SVM) using feature(MFCC&Sp_Contrast) and feature reduction(None)

No description has been provided for this image
Model(SVM) using feature(Spec_Cent&BW) and feature reduction(LDA)

No description has been provided for this image
Model(SVM) using feature(Spec_Cent&BW) and feature reduction(PCA)

No description has been provided for this image
Model(SVM) using feature(Spec_Cent&BW) and feature reduction(None)

No description has been provided for this image
Model(SVM) using feature(all_features) and feature reduction(LDA)

No description has been provided for this image
Model(SVM) using feature(all_features) and feature reduction(PCA)

No description has been provided for this image
Model(SVM) using feature(all_features) and feature reduction(None)

No description has been provided for this image

Sorting and displaying Top Results¶

In [21]:
df_sorted = Result_DF.sort_values(by='Mean Accuracy', ascending=False)
df_sorted.head()
Out[21]:
Model Feature Set Feature Reduction Mean Accuracy Std Accuracy
47 SVM MFCC&Sp_Contrast None 99.655172 0.344828
46 SVM MFCC&Sp_Contrast PCA 99.655172 0.344828
53 SVM all_features None 99.485120 0.297272
52 SVM all_features PCA 99.485120 0.297272
10 KNN MFCC&Sp_Contrast PCA 99.312707 0.002362

Ploting Results¶

In [22]:
Plot_Results_Barchart(Result_DF)
Maximum accuracy if feature reduction(LDA): 98.97%
No description has been provided for this image
Maximum accuracy if feature reduction(PCA): 99.66%
No description has been provided for this image
Maximum accuracy if feature reduction(None): 99.66%
No description has been provided for this image

Random Student Set3¶

In [23]:
import random

random.seed(1403)

# Get unique student IDs
student_ids = data['student_id'].unique()

# Randomly select 6 students
selected_students = random.sample(list(student_ids), 6)
print(f"Selected Students: {selected_students}")
Selected Students: [np.int64(810101401), np.int64(610300070), np.int64(810103054), np.int64(810100261), np.int64(810103317), np.int64(810101456)]
In [24]:
from sklearn.utils import shuffle

filtered_data = data[data['student_id'].isin(selected_students)]

# Ensure each student has an equal number of samples
balanced_data = filtered_data.groupby('student_id').apply(lambda x: x.sample(n=filtered_data['student_id'].value_counts().min(), random_state=42)).reset_index(drop=True)
balanced_data = shuffle(balanced_data, random_state=42)
balanced_data.reset_index(drop=True, inplace=True)

print("Number of samples for each student in the balanced DataFrame:")
print(balanced_data['student_id'].value_counts())
Number of samples for each student in the balanced DataFrame:
student_id
810103054    139
810103317    139
810101401    139
810100261    139
610300070    139
810101456    139
Name: count, dtype: int64
/tmp/ipykernel_15412/3536119780.py:6: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
  balanced_data = filtered_data.groupby('student_id').apply(lambda x: x.sample(n=filtered_data['student_id'].value_counts().min(), random_state=42)).reset_index(drop=True)

Classification and Analysis for Student Set3¶

In [25]:
# Map student IDs to class labels (e.g., 0 to 5)
class_mapping = {student_id: idx for idx, student_id in enumerate(selected_students)}
balanced_data['class'] = balanced_data['student_id'].map(class_mapping)

print("Balanced DataFrame with Class Column:")
balanced_data.tail()
Balanced DataFrame with Class Column:
Out[25]:
filename label student_id Mfcc_1 Mfcc_2 Mfcc_3 Mfcc_4 Mfcc_5 Mfcc_6 Mfcc_7 ... Spectral_Contrast_3 Spectral_Contrast_4 Spectral_Contrast_5 Spectral_Contrast_6 Spectral_Contrast_7 Zero_Crossing_Rate RMS_Energy Spectral_Centroid Spectral_Bandwidth class
829 hw1_q1_610300070_female.mp3_78 female 610300070 -175.235198 102.698340 -29.725138 42.060783 -30.536303 -2.644054 -35.779279 ... 22.399559 18.455236 19.463346 21.840109 63.809065 0.081916 0.253871 1622.742829 1600.531925 1
830 hw1_q5_610300070_female.mp3_3 female 610300070 -228.418644 113.546638 -1.372091 31.949328 -20.382214 -5.201409 -15.191473 ... 20.273068 16.249352 17.684594 22.089427 63.913752 0.061365 0.159174 1363.107072 1594.767166 1
831 hw1_q1_810100261_male.mp3.mp3_1 male 810100261 -210.533109 107.467373 -34.426835 39.903486 1.809619 -4.967549 -22.326968 ... 21.508174 19.111921 21.422158 26.465331 62.774750 0.108192 0.148817 1925.854474 1603.347033 3
832 hw1_q1_810101456_female.mp3_98 female 810101456 -202.050703 95.903137 -48.087167 19.064818 -20.076070 -17.911852 -34.061431 ... 24.042378 20.284681 22.179613 20.614573 65.130201 0.077171 0.185471 1654.069415 1611.754747 5
833 hw1_q1_610300070_female.mp3_18 female 610300070 -223.128404 105.228311 -8.214977 41.257315 -31.044399 4.772623 -33.162689 ... 20.590256 16.414522 19.069961 19.842611 62.274789 0.078410 0.159881 1593.538096 1701.605259 1

5 rows × 28 columns

Classification Process and Average Confusion Matrix¶

Calling the Classification Function¶

In [26]:
Result_DF = Run_Classification(balanced_data)
Model(KNN) using feature(time_domain_features) and feature reduction(LDA)

No description has been provided for this image
Model(KNN) using feature(time_domain_features) and feature reduction(PCA)

No description has been provided for this image
Model(KNN) using feature(time_domain_features) and feature reduction(None)

No description has been provided for this image
Model(KNN) using feature(MFCC) and feature reduction(LDA)

No description has been provided for this image
Model(KNN) using feature(MFCC) and feature reduction(PCA)

No description has been provided for this image
Model(KNN) using feature(MFCC) and feature reduction(None)

No description has been provided for this image
Model(KNN) using feature(Spectral Contrast) and feature reduction(LDA)

No description has been provided for this image
Model(KNN) using feature(Spectral Contrast) and feature reduction(PCA)

No description has been provided for this image
Model(KNN) using feature(Spectral Contrast) and feature reduction(None)

No description has been provided for this image
Model(KNN) using feature(MFCC&Sp_Contrast) and feature reduction(LDA)

No description has been provided for this image
Model(KNN) using feature(MFCC&Sp_Contrast) and feature reduction(PCA)

No description has been provided for this image
Model(KNN) using feature(MFCC&Sp_Contrast) and feature reduction(None)

No description has been provided for this image
Model(KNN) using feature(Spec_Cent&BW) and feature reduction(LDA)

No description has been provided for this image
Model(KNN) using feature(Spec_Cent&BW) and feature reduction(PCA)

No description has been provided for this image
Model(KNN) using feature(Spec_Cent&BW) and feature reduction(None)

No description has been provided for this image
Model(KNN) using feature(all_features) and feature reduction(LDA)

No description has been provided for this image
Model(KNN) using feature(all_features) and feature reduction(PCA)

No description has been provided for this image
Model(KNN) using feature(all_features) and feature reduction(None)

No description has been provided for this image
Model(Logistic Regression) using feature(time_domain_features) and feature reduction(LDA)

No description has been provided for this image
Model(Logistic Regression) using feature(time_domain_features) and feature reduction(PCA)

No description has been provided for this image
Model(Logistic Regression) using feature(time_domain_features) and feature reduction(None)

No description has been provided for this image
Model(Logistic Regression) using feature(MFCC) and feature reduction(LDA)

No description has been provided for this image
Model(Logistic Regression) using feature(MFCC) and feature reduction(PCA)

No description has been provided for this image
Model(Logistic Regression) using feature(MFCC) and feature reduction(None)

No description has been provided for this image
Model(Logistic Regression) using feature(Spectral Contrast) and feature reduction(LDA)

No description has been provided for this image
Model(Logistic Regression) using feature(Spectral Contrast) and feature reduction(PCA)

No description has been provided for this image
Model(Logistic Regression) using feature(Spectral Contrast) and feature reduction(None)

No description has been provided for this image
Model(Logistic Regression) using feature(MFCC&Sp_Contrast) and feature reduction(LDA)

No description has been provided for this image
Model(Logistic Regression) using feature(MFCC&Sp_Contrast) and feature reduction(PCA)

No description has been provided for this image
Model(Logistic Regression) using feature(MFCC&Sp_Contrast) and feature reduction(None)

No description has been provided for this image
Model(Logistic Regression) using feature(Spec_Cent&BW) and feature reduction(LDA)

No description has been provided for this image
Model(Logistic Regression) using feature(Spec_Cent&BW) and feature reduction(PCA)

No description has been provided for this image
Model(Logistic Regression) using feature(Spec_Cent&BW) and feature reduction(None)

No description has been provided for this image
Model(Logistic Regression) using feature(all_features) and feature reduction(LDA)

No description has been provided for this image
Model(Logistic Regression) using feature(all_features) and feature reduction(PCA)

No description has been provided for this image
Model(Logistic Regression) using feature(all_features) and feature reduction(None)

No description has been provided for this image
Model(SVM) using feature(time_domain_features) and feature reduction(LDA)

No description has been provided for this image
Model(SVM) using feature(time_domain_features) and feature reduction(PCA)

No description has been provided for this image
Model(SVM) using feature(time_domain_features) and feature reduction(None)

No description has been provided for this image
Model(SVM) using feature(MFCC) and feature reduction(LDA)

No description has been provided for this image
Model(SVM) using feature(MFCC) and feature reduction(PCA)

No description has been provided for this image
Model(SVM) using feature(MFCC) and feature reduction(None)

No description has been provided for this image
Model(SVM) using feature(Spectral Contrast) and feature reduction(LDA)

No description has been provided for this image
Model(SVM) using feature(Spectral Contrast) and feature reduction(PCA)

No description has been provided for this image
Model(SVM) using feature(Spectral Contrast) and feature reduction(None)

No description has been provided for this image
Model(SVM) using feature(MFCC&Sp_Contrast) and feature reduction(LDA)

No description has been provided for this image
Model(SVM) using feature(MFCC&Sp_Contrast) and feature reduction(PCA)

No description has been provided for this image
Model(SVM) using feature(MFCC&Sp_Contrast) and feature reduction(None)

No description has been provided for this image
Model(SVM) using feature(Spec_Cent&BW) and feature reduction(LDA)

No description has been provided for this image
Model(SVM) using feature(Spec_Cent&BW) and feature reduction(PCA)

No description has been provided for this image
Model(SVM) using feature(Spec_Cent&BW) and feature reduction(None)

No description has been provided for this image
Model(SVM) using feature(all_features) and feature reduction(LDA)

No description has been provided for this image
Model(SVM) using feature(all_features) and feature reduction(PCA)

No description has been provided for this image
Model(SVM) using feature(all_features) and feature reduction(None)

No description has been provided for this image

Sorting and displaying Top Results¶

In [27]:
df_sorted = Result_DF.sort_values(by='Mean Accuracy', ascending=False)
df_sorted.head()
Out[27]:
Model Feature Set Feature Reduction Mean Accuracy Std Accuracy
9 KNN MFCC&Sp_Contrast LDA 99.880383 0.207183
51 SVM all_features LDA 99.880383 0.207183
33 Logistic Regression all_features LDA 99.880383 0.207183
34 Logistic Regression all_features PCA 99.880383 0.207183
35 Logistic Regression all_features None 99.880383 0.207183

Ploting Results¶

In [28]:
Plot_Results_Barchart(Result_DF)
Maximum accuracy if feature reduction(LDA): 99.88%
No description has been provided for this image
Maximum accuracy if feature reduction(PCA): 99.88%
No description has been provided for this image
Maximum accuracy if feature reduction(None): 99.88%
No description has been provided for this image
In [ ]: